Goto

Collaborating Authors

 fusion method


Projection-Manifold Regularized Latent Diffusion for Robust General Image Fusion

Neural Information Processing Systems

This study proposes PDFuse, a robust, general training-free image fusion framework built on pre-trained latent diffusion models with projection-manifold regularization. By redefining fusion as a diffusion inference process constrained by multiple source images, PDFuse can adapt to varied image modalities and produce high-fidelity outputs utilizing the diffusion prior. To ensure both source consistency and full utilization of generative priors, we develop novel projection-manifold regularization, which consists of two core mechanisms. On the one hand, the Multisource Information Consistency Projection (MICP) establishes a projection system between diffusion latent representations and source images, solved efficiently via conjugate gradients to inject multi-source information into the inference. On the other hand, the Latent Manifold-preservation Guidance (LMG) aligns the latent distribution of diffusion variables with that of the sources, guiding generation to respect the model's manifold prior.




Probabilistic Fusion and Calibration of Neural Speaker Diarization Models

arXiv.org Artificial Intelligence

End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated fusion and calibration techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, that fusion substantially improves over individual models, and that the Fuse-then-Calibrate ordering generally outperforms both calibrating before fusion and uncalibrated fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.


Event-RGB Fusion for Spacecraft Pose Estimation Under Harsh Lighting

arXiv.org Artificial Intelligence

Spacecraft pose estimation is crucial for autonomous in-space operations, such as rendezvous, docking and on-orbit servicing. Vision-based pose estimation methods, which typically employ RGB imaging sensors, is a compelling solution for spacecraft pose estimation, but are challenged by harsh lighting conditions, which produce imaging artifacts such as glare, over-exposure, blooming and lens flare. Due to their much higher dynamic range, neuromorphic or event sensors are more resilient to extreme lighting conditions. However, event sensors generally have lower spatial resolution and suffer from reduced signal-to-noise ratio during periods of low relative motion. A beam-splitter prism was employed to achieve precise optical and temporal alignment. Then, a RANSAC-based technique was developed to fuse the information from the RGB and event channels to achieve pose estimation that leveraged the strengths of the two modalities. The pipeline was complemented by dropout uncertainty estimation to detect extreme conditions that affect either channel. To benchmark the performance of the proposed event-RGB fusion method, we collected a comprehensive real dataset of RGB and event data for satellite pose estimation in a laboratory setting under a variety of challenging illumination conditions. Encouraging results on the dataset demonstrate the efficacy of our event-RGB fusion approach and further supports the usage of event sensors for spacecraft pose estimation. To support community research on this topic, our dataset has been released publicly. Keywords: event-based pose estimation, rendezvous, domain gap, sensor fusion, close proximity, harsh lighting1. Introduction Spacecraft pose estimation is the problem of determining the 6-degrees-of-freedom (6DoF) pose consisting of the position and orientation of a space-borne object, typically a satellite. It is a critical step in a wide range of space applications, including rendezvous, close proximity operations, debris removal, refueling and on-orbit servicing [1, 2, 3, 4]. Robust pose estimation is paramount to safely and effectively executing these tasks [5, 6]. Several types of sensor technologies can be employed for spacecraft pose estimation, but they are all subject to size-weight-power and cost (SWaP-C) constraints. Optical sensors such as RGB imaging sensors are favored due to their low SWaP-C requirements, high resolution and the availability of established vision-based algorithms. However, operating in the space environment can present nontrivial challenges to vision-based systems.


MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning

arXiv.org Artificial Intelligence

The aim of multimodal neural networks is to combine diverse data sources, referred to as modalities, to achieve enhanced performance compared to relying on a single modality. However, training of multimodal networks is typically hindered by modality overfitting, where the network relies excessively on one of the available modalities. This often yields sub-optimal performance, hindering the potential of multimodal learning and resulting in marginal improvements relative to unimodal models. In this work, we present the Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models in a balanced manner. MILES leverages the differences in modality-wise conditional utilization rates during training to effectively balance multimodal learning. The learning rate is dynamically adjusted during training to balance the speed of learning from each modality by the multimodal model, aiming for enhanced performance in both multimodal and unimodal predictions. We extensively evaluate MILES on four multimodal joint fusion tasks and compare its performance to seven state-of-the-art baselines. Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study, effectively balancing modality usage during training. This results in improved multimodal performance and stronger modality encoders, which can be leveraged when dealing with unimodal samples or absent modalities. Overall, our work highlights the impact of balancing multimodal learning on improving model performance.


Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion

arXiv.org Artificial Intelligence

Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.